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Abstract 

Early  pathogen  exposure  detection  allows  better  patient  care  and  faster  implementation 
of  public  health  measures  (patient  isolation,  contact  tracing).  Existing  exposure 
detection  most  frequently  relies  on  overt  clinical  symptoms,  namely  fever,  during  the 
infectious  prodromal  period.  We  have  developed  a  robust  machine  learning  method  to 
better  detect  asymptomatic  states  during  the  incubation  period  using  subtle,  sub-clinical 
physiological  markers.  Using  high-resolution  physiological  data  from  non-human 
primate  studies  of  Ebola  and  Marburg  viruses,  we  pre-processed  the  data  to  reduce 
short-term  variability  and  normalize  diurnal  variations,  then  provided  these  to  a 
supervised  random  forest  classification  algorithm.  In  most  subjects  detection  is 
achieved  well  before  the  onset  of  fever;  subject  cross-validation  lead  to  52±14h  mean 
early  detection  (at  >0.90  area  under  the  receiver-operating  characteristic  curve).  Cross¬ 
cohort  tests  across  pathogens  and  exposure  routes  also  lead  to  successful  early  detection 
(28±16h  and  43±22h,  respectively).  We  discuss  which  physiological  indicators  are  most 
informative  for  early  detection  and  options  for  extending  this  capability  to  lower  data 
resolution  and  wearable,  non-invasive  sensors. 
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Introduction 


We  have  developed  a  method  for  assessing  viral  exposure  based  solely  on  host 
physiological  signals,  in  contrast  to  conventional  diagnostics  based  on  fever  or 
biomolecules  [1]  of  the  pathogen  itself  or  the  host's  immune  response.  Early  warning  of 
pathogen  exposure  has  many  advantages:  earlier  patient  care  increases  the  probability 
of  a  positive  prognosis  [2-5]  and  faster  public  health  measure  deployment,  such  as 
patient  isolation  and  contact  tracing  [6-8],  which  reduces  transmission  [9].  Following 
pathogen  exposure,  there  exists  a  "pre-symptomatic"  incubation  phase  where  overt 
clinical  symptoms  are  not  yet  present  [10].  This  incubation  phase  can  vary  from  days  to 
years  depending  on  the  virus  [11,12],  and  is  reported  to  be  3-25  days  for  many 
hemorrhagic  fevers  [3,4,13,14],  Following  this  incubation  phase,  the  prodromal  period 
is  marked  by  non-specific  symptoms  such  as  fever,  rash,  loss  of  appetite,  and 
hypersomnia  [10].  Figure  1  shows  a  conceptual  model  of  the  probability  of  infection 
detection  P„  during  different  post-exposure  periods  (incubation,  prodrome,  and  virus- 
specific  symptoms)  for  current  specific  and  non-specific  (i.e.,  symptoms-based) 
diagnostics.  We  also  include  what  may  be  considered  an  "ideal"  sensor  system  capable 
of  detecting  viral  exposure  even  during  the  earliest  incubation  period.  We  hypothesize 
that  quantifiable  abnormalities  (versus  a  diurnal  baseline,  for  instance)  in  high- 
resolution  physiological  signals,  such  as  those  from  electrocardiography, 
hemodynamics,  and  temperature,  before  overt  clinical  signs  could  be  a  basis  for  the  ideal 
signal  in  Figure  1,  thereby  providing  advanced  warning  (the  early  warning  time.  At)  of 
on-coming  illness. 

In  addition  to  characteristic  clinical  presentations,  most  infectious  disease  diagnosis  is 
based  upon  identification  of  pathogen-specific  molecular  signatures  (via  culture, 
PCR/RT-PCR  or  sequencing  for  DNA  or  RNA,  or  immunocapture  assays  for  antigen  or 
antibody)  in  a  relevant  biological  fluid  [10,15-22],  Exciting  new  approaches  allowed  by 
high-throughput  sequencing  have  shown  the  promise  of  pre-symptomatic  detection 
using  genomic  [23,24]  or  transcriptional  [25]  expression  profiles  in  the  host  [26]. 
However,  these  approaches  suffer  from  often  prohibitively  steep  logistic  burdens  and 
associated  costs  (cold  chain  storage,  equipment  requirements,  extremely  qualified 
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operators,  serial  sampling):  indeed,  most  infections  presented  clinically  are  never 
definitively  determined  etiologically,  much  less  serially  sampled.  Furthermore, 
molecular  diagnostics  are  rarely  used  until  patient  self-reporting  and  presentation  of 
overt  clinical  symptoms,  such  as  fever.  Past  physiological  signal-based  early  infection 
detection  work  has  been  heavily  focused  on  bacterial  infection  [27-32],  and  largely 
centered  upon  higher  time  resolution  analysis  of  body  core  temperature  [32,33], 
advanced  analyses  of  strongly-confounded  signals  such  as  heart  rate  variability  [28-30] 
or  social  dynamics  [34],  or  sensor  data  fusion  from  already  symptomatic  (febrile)  viral- 
infected  individuals  [35] .  While  great  progress  has  been  made  in  developing  techniques 
for  signal-based  early  warning  of  bacterial  infections,  we  are  unaware  of  any  effort  in 
extending  these  techniques  to  possibly  life-threatening  viral  infections. 

Electronics  miniaturization  has  led  to  a  wave  of  wearable  sensing  technologies  for 
health  monitoring  [36],  and  increasingly  more  processing  power  is  available  to 
consumers  to  make  meaningful  use  of  these  collected  data  [37],  Inspired  by  these 
developments,  we  envision  a  low-profile,  robust,  wearable,  personalized  and  multi¬ 
modal  physiological  monitoring  system  persistently  measuring  signals  capable  of 
sensitive  pathogen  infection  detection.  Such  a  system  could  cue  the  use  of  highly 
specific  (but  expensive)  diagnostic  tests,  prompt  low-regret  responses  such  as  patient 
isolation  and  observation,  or  advise  clinicians  of  fulminant  complications  in  already 
compromised  patients. 

We  use  high-resolution  physiological  data  from  non-human  primates  (NHPs)  exposed 
via  intramuscular  (IM)  or  aerosol  routes  to  either  of  two  viral  hemorrhagic  fevers  (Ebola 
virus  [EBOV]  and  Marburg  virus  [MARV])  to  build  this  novel  high  sensitivity,  low 
etiological  specificity  (that  is,  not  informative  of  particular  pathogens)  processing  and 
detection  algorithm.  Data  is  normalized  to  remove  diurnal  rhythms,  aggregated  to 
reduce  short-term  fluctuations,  and  then  provided  to  a  supervised  binary  classification 
(exposed  and  unexposed)  machine  learning  algorithm  as  illustrated  in  Figure  2(a).  We 
tested  and  compared  several  methods;  RFs  had  the  best  positive  predictive  value 
(discussed  below)  and  were  chosen  for  the  rest  of  our  analysis.  RFs  were  also  chosen 
for  their  robustness  against  feature-rich  and  noisy  data  while  minimizing  over-fitting. 
Random  forests  are  grown  (trained)  at  two  post-exposure  stages,  allowing  the 
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algorithms  to  adapt  to  physiological  changes  between  incubation  and  prodromal 
phases.  One  RF  is  trained  using  pre-fever  physiological  data  and  the  other  using  post¬ 
fever  data.  Both  RFs  include  pre-exposure  data  to  build  the  unexposed  class.  Subject 
data  is  separated  into  training  and  testing  sets,  and  every  testing  subject's  data  is 
provided  to  the  RF  model  for  an  exposure  prediction  every  30  min.  After  using  binary 
integration  and  a  constant  false  alarm  thresholding  approach  to  further  reduce  false 
alarms,  mean  exposure  declaration  times  are  found  to  range  from  21h  (for  EBOV)  to  69h 
(for  MARV)  before  the  onset  of  fever  defined  as  1.5°C  above  a  diurnal  baseline  [38] 
sustained  for  two  hours.  Figure  2(b)  shows  a  block  diagram  of  the  declaration  process. 
We  note  that  all  physiological  data  is  given  to  our  algorithm  without  regard  to  exposure 
or  fever  status;  in  other  words,  our  approach  does  not  require  information  on  exposure 
or  fever  times  for  successful  classification  and  detection. 

Implementing  this  type  of  early-warning  algorithm  could  save  lives  of  health  care 
workers,  military  service  members,  patients,  and  other  susceptible  individuals.  During 
the  2014  West  Africa  Ebola  outbreak,  for  instance,  health  care  workers  at  higher  risk  of 
viral  exposure  could  have  been  monitored  persistently  for  the  earliest  possible 
indications  of  viral  exposure.  More  commonly,  patients  in  post-operative  or  critical 
care  units  could  be  monitored  for  infection  and  treated  well  before  clinical  symptoms, 
viremia/ bacteremia,  or  septic  shock  [39].  In  future,  etiologically-specific  iterations  of 
this  approach,  knowledge  of  causative  pathogens  could  inform  very  early  therapeutic 
intervention.  Furthermore,  using  very  feature  sparse  datasets,  such  as  those  that  could 
be  collected  using  wearable  sensor  platforms,  would  enable  this  technique  to  be 
implemented  in  non-ideal  clinical,  athletic,  and  military  environments.  Transitioning 
this  technology  to  these  contexts  is  the  focus  of  ongoing  work. 


Results 

We  use  high  resolution  physiological  data  collected  during  previously  conducted 
natural  history  studies  (presently  unpublished)  at  the  United  States  Army  Medical 
Research  Institute  of  Infectious  Diseases  (USAMRIID)  to  build  a  binary  classification 
random  forest  machine  learning  model  [40]  for  detecting  whether  an  animal  had  been 
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exposed  to  a  viral  hemorrhagic  fever  virus  (either  Ebola  or  Marburg  virus).  Supervised 
machine  learning  algorithms  observe  characteristics  in  data  that  belong  to  pre¬ 
determined  classes,  then  place  new,  unseen  data  into  the  appropriate  class  based  on 
similar  characteristics.  Here,  we  define  pre-  and  post-exposure  as  the  two  classes  since 
"infection"  is  not  a  discrete  event. 

We  experimented  with  several  classification  methods,  including  Naive  Bayes[41],  k- 
Nearest  Neighbors  [42],  and  random  forests  (RFs),  and  compared  each  across 
sensitivity,  specificity,  and  early  warning  time  metrics.  All  classifiers  had  positive 
predictive  values,  yet  chose  RFs  for  several  reasons  (results  for  other  classifiers  are 
found  in  Supplementary  Figure  1).  Most  importantly.  Random  Forests  require  no 
assumptions  about  the  statistical  independence  of  features,  which  is  critically  useful 
given  highly  correlated  physiological  feature  sets.  RFs  also  allow  for  the  calculation  of 
quantitative  feature  performance;  this  also  facilitates  post-hoc  comparison  to  the  known 
viral  pathology  sequence  to  mechanistically  understand  why  these  physiological 
anomalies  are  present.  Furthermore,  the  most  discriminating  features  can  be  selectively 
chosen  to  re-grow  forests  and  allow  for  better  algorithm  performance  with  fewer 
feature  inputs.  Next,  a  collection  of  trees  within  each  model  grown  on  different  subsets 
of  the  full  training  set  prevents  over-fitting  (which  is  commonly  seen  in  single  decision 
trees)  and  reduces  variance.  Finally,  in  empirical  comparisons  of  many  machine 
learning  methods,  RFs  consistently  rank  among  the  best  approaches  [43],  and  we  too 
found  RFs  to  produce  the  best  outputs  among  the  classifiers  tested.  We  employ  RFs  for 
both  cross-study  and  intra-study  validations  using  different  testing  and  training 
datasets  (details  in  Methods). 

Before  analysis,  several  data  pre-processing  steps  are  required  to  remove  time  as  an 
implicit  feature  in  our  physiological  datasets.  First,  data  is  normalized  and  aggregated 
subject-by-subject  to  eliminate  short-term  fluctuations  and  daily  diurnal  rhythms.  From 
these  normalized  datasets,  mean  and  quantiles  are  calculated  for  adjacent  30  minute 
time  windows  (see  Figure  3);  these  first-  and  second-order  statistical  measures  are  the 
features  provided  to  the  machine  learning  algorithm.  Two  RF  models  are  trained  to 
detect  the  post-exposure  class  at  distinct  time  epochs:  one  model  is  tuned  to  detect 
subtle  data  markers  during  the  incubation  phase  prior  to  fever,  while  the  second  model 
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is  tuned  for  the  early  prodromal  phase  (i.e.,  onset  of  overt  febrile  symptoms)  where 
temperature-related  features  emerge  as  powerful  discriminants.  The  training  data  for 
the  pre-exposure  class  for  both  models  is  a  subset  of  baseline  data  prior  to  challenge 
and  the  quantity  of  training  data  has  been  balanced  for  the  negative  (pre-exposure)  and 
positive  (post-exposure)  classes  to  avoid  biasing  one  class  over  the  other.  For  the  rest  of 
our  analysis,  data  from  12h  before  and  24h  after  challenge  are  excluded  from 
performance  metrics  due  to  differences  in  animal  handling  and  sedation  for  exposure. 
Additional  details  on  data  pre-processing  and  algorithm  development  may  be  found  in 
the  Methods  section. 

One  output  of  RFs  is  a  measure  of  relative  feature  importance;  that  is,  which  features 
provide  the  most  accurate  separation  between  classes.  The  most  discriminating  features 
for  the  pre-fever  and  post-fever  RF  models  vary  among  four  feature  types  derived  from 
temperature,  ECG,  blood  pressure,  and  respiration  measurements.  (See  Supplementary 
Table  1  and  Supplementary  Figures  2-3  for  a  complete  listing  of  most  discriminating 
features.)  The  algorithm  reports  features  that  follow  clinical  symptomology,  namely 
that  core  temperature  in  the  post-fever,  prodromal  model  is  the  highest  ranking  in 
feature  importance.  Before  fever,  however,  subtle  ECG  and  blood  pressure  derived 
features  seem  to  be  the  highest  ranking  in  feature  importance,  as  is  observed  at  the 
earliest  stages  of  sepsis  [28-31]  (further  noted  in  the  Discussion  below). 


Intra-study  (3-fold)  cross-validations 

Intra-study  tests  (i.e.,  training  and  tested  with  data  from  the  same  NHP  study)  are  used 
first  for  testing  our  model's  accuracy  and  early  warning  capability.  RFs  are  built  using 
two-thirds  of  the  subjects  then  tested  for  each  subject  in  the  remaining  one-third  of  left- 
out  subjects;  the  process  is  repeated  three  times,  allowing  each  subject  in  the  study  to  be 
evaluated  once  as  part  of  a  left-out  test  data  set.  The  resulting  outcomes  for  all  subjects 
are  combined  and  evaluated  in  subsequent  performance  metrics.  Three-fold  cross- 
validation  is  chosen  due  to  limited  study  sample  sizes.  These  intra-study  tests 
necessarily  hold  constant  factors  such  as  animal  species,  exposure  route,  and  virus,  and 
thus  provide  insight  into  the  model's  performance  when  these  factors  are  known  or  are 
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constant.  Figure  4  shows  representative  examples  of  our  algorithm's  output  for  each 
intra-study  test.  Every  30  minutes,  the  combined  score  (see  Figure  2)  of  the  pre-  and 
post-fever  forests  is  plotted,  representing  the  a  posteriori  probability  that  the  subject  is  in 
the  "exposed"  class.  In  other  words,  values  closer  to  1.0  indicate  a  higher  confidence 
prediction  for  a  subject  having  been  exposed  to  the  virus.  Qualitatively,  we  note  that 
most  subjects'  scores  rise  around  the  challenge  time  (though  data  12h  before  and  24h 
after  exposure  are  disregarded).  To  quantify  performance,  we  calculate  probability  of 
infection  detection  P,  and  probability  of  false  declaration  P„  for  the  collection  of  system 
outputs  (updated  every  30  minutes).  Associated  with  these  are  the  95%  confidence 
intervals  for  a  standard  Gaussian.  In  cases  where  no  false  declarations  were  made 
within  the  study  sample,  we  provide  an  upper  bound  on  P,„  For  the  MARV IM  study 

(system  P,,=0.95±0.008,  P„  <0.002,  At . =74.5+6.0h),  the  scores  rise  sharply  after  challenge 

and  remain  high  throughout  the  remainder  of  the  study.  The  MARV  aerosol  (system 

P,=0.79±0.02,  P„  <0.002  zip . =44.4+26.1  h)  and  EBOV  aerosol  (system  P,=0.65±0.02, 

P,.=0. 01+0. 005,  At . =23. 0+30. 3h)  studies  show  moderate  elevations  at  challenge  time  and 

fluctuate  the  first  few  days  before  rising  sharply  12  to  24h  before  acute  fever  (vertical 
red  line).  This  behavior  can  be  explained  by  trends  in  the  individual  forest  scores.  The 
pre-fever  forest  is  trained  on  data  with  subtle,  sub-clinical  changes  from  pre-exposure 
baseline  which  become  more  obvious  and  detectable  in  the  hours  leading  up  to  fever 
onset  (and  when  the  animal  is  anesthetized  prior  to  challenge).  Variability  in  the 
combined  score  before  fever  can  be  understood  both  by  considering  the  individual 
animal's  immune  response  to  the  pathogen,  and  the  inter-individual  variability  of  this 
response  when  training  the  algorithm  across  subjects.  Furthermore,  variability  in  the 
pre-fever  results  and  lower  early  warning  time  for  the  EBOV  study  may  be  due  to  a 
much  lower  target  exposure  dose  (lOOpfu  target)  than  either  of  the  MARV  studies 
(lOOOpfu  target).  After  febrile  symptoms,  the  post-fever  forest  dominates  the  score  as  it 
indicates  a  strong  and  easily  detectable  deviation  from  the  baseline  and  is  how  current 
clinical  diagnosis  is  largely  based. 

To  quantitatively  assess  whether  a  subject  has  been  exposed,  we  use  a  false  positive 
threshold  method  (details  in  Methods  section)  to  build  a  binary  decision  from  the  RF 
models,  then  employed  a  binary  integration  step  to  make  a  final  declaration  that  a 
subject  is  exposed.  These  two  steps  afford  much  greater  sensitivity  and  specificity  than 
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relying  on  RF  model  score  outputs  alone  [44],  Briefly,  using  baseline  pre-exposure  data, 
we  threshold  scores  from  each  RF  model  and  make  an  'initial  detection'  decision  every 
30  minutes.  Next,  we  perform  binary  integration  which  accumulates  the  number  of 
positive  detections,  m,  observed  in  the  past  n  time  steps.  At  each  time  step,  if  the 
accumulated  detections  are  greater  than  or  equal  to  m  (here  we  used  m=ll  and  n=24), 
we  output  a  'declaration'  that  the  subject  is  in  the  exposed  class  at  that  time  step.  We 
find  threshold  values  for  each  RF  model  by  sweeping  across  a  series  of  possible 
thresholds  from  0  to  1.  For  each  threshold  in  the  series,  the  proportion  of  false 
declarations  P,„  is  calculated  using  3-fold  cross  validation,  in  the  same  manner  as 
described  for  the  RFs.  Thresholds  are  estimated  as  the  smallest  value  for  each  RF  model 
that  supported  a  desired  constant  false  positive  level  (here  we  chose  Pfa= 0  .01).  Figure  4 
shows  our  algorithm's  combined  score  output  (after  thresholding  and  binary 
integration,  see  Figure  2b),  declarations  (green  triangles),  and  onset  of  fever  (red  vertical 
lines)  for  three  representative  subjects  in  each  study.  The  time  between  our  algorithm's 
first  true  declaration  (green  line)  and  fever  onset  is  defined  here  as  the  early  warning 
time  (At).  (Each  subject's  early  warning  times,  as  well  as  additional  algorithm 
performance  parameters,  may  be  found  in  Supplementary  Table  2.)  As  with  P„  and  P„ 
we  report  95%  confidence  intervals  associated  with  each  At.  Flowever,  unlike  P,  and  P„ 
the  number  of  trials  for  early  warning  time  are  small  (20  subjects  per  test  at  most)  so  the 
confidence  intervals  are  based  on  f-distributions  with  the  degrees  of  freedom  equal  to 
the  number  of  subjects  minus  1. 

The  early  warning  capacity  for  these  intra-study  tests  demonstrate  the  ability  to  find 
meaningful  At  values  when  the  animal  species,  exposure  route,  and  viral  agent  is 
known.  We  can  imagine  a  context  where  such  information  is  known,  such  as  a 
healthcare  worker  experiencing  an  accidental  needle  stick  in  a  known  outbreak,  or  a 
laboratory  employee  after  an  accidental  protective  equipment  breech.  However,  most 
exposures  will  occur  when  many  of  these  variables  are  unknown  or  impossible  to 
know,  which  emphasizes  the  need  to  experiment  with  testing  and  training  our 
algorithm  across  these  variables. 

Cross-study  Validations 
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Next,  we  used  cross-study  validations  to  indicate  our  model's  extensibility  beyond  a 
given  animal  model,  pathogen,  or  exposure  route.  In  one  version  of  a  cross-study 
validation,  all  data  from  one  NHP  study  are  used  to  train  RF  models,  all  data  from  a 
second  study  are  used  to  test  that  model,  and  an  identical  false  positive  thresholding 
and  binary  integration  method  for  detection/ declaration  as  used  above  is  applied. 
Algorithm  outputs  and  detection  plots  are  interpreted  identically  as  in  the  intra-study 
validation  tests.  Figure  5  shows  one  representative  subject's  output  for  each  of  the 
(train/test)  MARV  intramuscular /MARV  aerosol  (system  PM). 0.81  ±0.02,  P.  M).04±0.01, 
Atm„„- 42. 5+22. lh)  and  EBOV  aerosol  /  MARV  aerosol  (system  P  M). 72+0.02,  P„ 
=0.01+0.005,  df,„„=28.3±16.2h)  cross-study  validations.  These  combinations  are  chosen 
to  hold  the  pathogen  and  exposure  routes  constant,  respectively.  The  EBOV  aerosol  / 
MARV  aerosol  validation  test  also  uses  studies  with  different  target  dose  exposure 
levels,  which  may  explain  the  lower  early  warning  time;  despite  this,  we  still  observe 
nearly  1  day  of  early  warning. 

In  another  version  of  a  cross-study  validation,  we  tested  the  most  generalized  scenario 
where  all  data  across  all  three  studies  are  used  to  test  and  train  a  RF  model.  In  this 
aggregate  study  where  the  species  of  animal,  exposure  route,  virus,  nor  target  dose  are 
held  constant,  we  find  a  system  PM). 80+0.01,  P,  <0.0005,  and  At„=5 2.8±12.9h.  These 
results  strongly  suggest  that  our  model  is  not  limited  to  particular  viruses  or  exposure 
routes,  but  rather  is  capable  of  indicating  a  general  patho-physiological  state  during  the 
viral  incubation  period  in  NHPs. 


Evaluating  Algorithm  Performance 

We  evaluated  our  algorithm's  performance  by  analyzing  the  probability  of  detection 
(Pd,  i.e.,  correctly  declaring  a  subject  as  being  exposed  after  the  viral  challenge)  versus 
false  positives  (Pfaf  i.e.,  incorrectly  declaring  a  subject  as  exposed  before  the  viral 
challenge),  known  as  a  receiver  operating  characteristic  (ROC)  curve  [45].  ROC  curves 
describe  the  sensitivity  (Pd)  and  specificity  (1  -Pfaf  I'M  not  informative  of  the  causative 
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agent)  of  a  test  and  can  be  partially  summarized  by  the  area  under  the  curve  (AUC).  An 
AUC  of  1.0  refers  to  a  perfectly  sensitive  and  specific  detector,  whereas  a  value  of  0.5 
indicates  that  the  test  cannot  distinguish  between  classes  and  is  no  better  than  a  coin¬ 
flip.  Figure  6  shows  ROC  curves  for  the  MARV  aerosol  intra-study,  the  MARV 
IM/MARV  aerosol  cross-study  tests,  and  the  aggregate  study  test  using  all  available 
data;  additional  ROC  curves  for  all  intra-study  and  cross-study  validations  can  be 
found  in  the  Supplementary  Figures  4-5.  We  conclude  that  for  each  intra-study  and 
cross-study  test  that  the  pre-fever  AUC  is  >0.90,  and  thus  each  pre-fever  RF  model  has 
significant  discriminating  power  for  early  detection  (details  in  Supplementary  Table  3). 
All  post-fever  RF  models  have  AUC  values  approaching  one,  indicating  nearly  perfect 
performance  during  febrile  symptoms  as  may  be  expected  given  such  as  clear  anomaly 
compared  to  baseline  values. 

Perhaps  the  most  clinically  useful  metric  of  our  algorithm  is  the  early  warning  time, 
defined  as  the  time  difference  between  our  algorithm's  first  correct  'declaration'  and  the 
onset  of  fever  (1.5°C  above  a  diurnal  baseline  [38]  sustained  for  two  hours).  Another 
useful  metric  from  an  algorithm  development  perspective  is  the  ROC  AUC  for  different 
subsets  of  study  data  collected  before  fever  (e.g.  the  interval  where  early  warning  is 
meaningful).  This  pre-fever  ROC  AUC  provides  a  robust  metric  for  performance 
comparisons  both  across  studies  and  evaluating  system  design  trade-offs  such  as 
reduced  feature  sets,  as  discussed  below. 

Extending  to  non-invasive  monitoring  platforms 

Physiological  data  features  provided  to  our  algorithm  were  collected  using  surgically 
implanted  monitoring  devices;  such  data  could  never  be  expected  from  military  service 
members,  health  care  workers  responding  to  an  outbreak,  hospital  patients,  or  the 
general  public.  As  an  in  silico  simulation  for  limiting  our  dataset  to  what  may  be 
collected  using  a  wearable-type  monitoring  device,  we  reduced  the  considered  feature 
set  to  include  only  certain  subsets:  ECG-only,  ECG  and  temperature,  heart  rate  and 
temperature,  temperature  alone,  and  heart  rate  alone.  Successful  use  of  ECG  data  as  a 
predictor  of  physiological  compensatory  potential  during  shock  has  been  reported 
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[28,29].  Ambulatory  Holter  monitor  devices  collect  exactly  this  type  of  data  [46],  as  do 
even  less  obtrusive  devices  for  performance  athletes.  Figure  7  shows  algorithm  output 
for  one  representative  subject  and  ROC  curves  for  the  MARV  aerosol  study  using  this 
ECG-only  feature  subset  (including  RR,  QRS,  PR,  and  QT  intervals;  the  relative 
importance  of  each  feature  is  shown  in  Supplementary  Figure  6).  Although  the 
sensitivity  of  this  ECG-only  algorithm  decreases  slightly  relative  to  the  baseline  feature 
set  (with  a  P,,=0.78±0.02  at  a  P/(, <0.001  vs  a  Pt, =0.79+0. 02  at  the  same  P  ),  the  mean  early 
warning  time  of  51. 1+23. Oh  is  still  very  clinically  useful.  Results  for  other  reduced 
feature  subsets  of  the  MARV  aerosol  study  are  provided  in  Figure  8  and  additional 
feature  importance  metrics  and  corresponding  ROC  curves  may  be  found  in 
Supplementary  Figures  6-7. 


Discussion 

Non-biochemical  detection  of  viral  incubation  periods  using  only  physiological  data 
presents  a  fundamentally  new  approach  to  infectious  disease  care.  Previous  work  has 
shown  that  reducing  transmission  during  the  viral  incubation  period  is  as  or  more 
effective  an  intervention  as  reducing  the  inherent  transmissibility  (R„)  of  the  pathogen  in 
controlling  emerging  outbreaks  [9].  However,  there  is  no  existing  method  to  detect  this 
pre-symptomatic  incubation  period  extensible  to  mobile  settings  or  wearable  sensor 
systems.  We  present  the  first  attempt  to  build  a  multi-modal  machine  learning 
algorithm  capable  of  determining  this  incubation  period  using  physiological  signals  of 
NHPs  infected  with  viral  hemorrhagic  fevers.  Using  the  Random  Forest  method  we 
avoid  building  over-fit  models,  and  successful  testing  and  training  on  different  subsets 
of  data  demonstrate  that  we  avoid  over-fitting.  Further,  cross-study  validations  show 
the  promise  of  extending  this  approach  beyond  a  given  animal  model,  exposure 
method,  or  virus.  All  intra-study  and  cross-study  validations  resulted  in  positive  mean 
early  warning  times,  with  times  that  would  be  actionable  (>20h)  for  intervention  or 
other  preventive  measures.  While  we  chose  a  target  system  P/t~0.01  that  was  supported 
by  the  limited  subject  numbers  in  the  studies  available,  this  would  not  lead  to  an 
acceptable  daily  false  alarm  rate  -  reducing  this  critical  system  parameter  to  more 
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clinically-acceptable  levels  (we  estimate  P„  ~103or  less)  is  the  subject  of  on-going  work, 
and  may  require  larger  sample  sizes  or  more  refined  processing  algorithms. 

We  postulate  that  immuno-biological  events  -  particularly  systemic  release  of  pro- 
inflammatory  chemokines  and  cytokines  from  infected  phagocytes  [47-51],  as  well  as 
afferent  signaling  to  the  central  nervous  system  [52,53]  -  are  recapitulated  in 
hemodynamic,  thermoregulatory,  or  cardiac  signals  which  may  be  more  easily 
measured  and  assessed  than  biomolecule  markers  for  viral  infection  (via  sequencing 
[23,24,26]  or  immunocapture  approaches  [15,16]).  For  instance,  prostaglandins  (PG)  are 
up-regulated  upon  infection  (including  EBOV  [54,55])  and  intricately  involved  in  the 
non-specific  "sickness  syndrome"  [56];  the  PGs  are  also  known  to  be  potent  vascular 
mediators  [57]  and  endogenous  pyrogens  [58,59].  Past  work  has  clarified  how  tightly 
integrated,  complex,  and  oscillating  biological  systems  can  become  uncoupled  [60-62] 
during  trauma  [63]  or  critical  illness  [31,64]  which  would  be  captured  in  the 
comprehensive,  multi-modal  physiological  datasets  used  in  our  present  study. 
Rigorously  pursuing  this  hypothesis  would  require  additional  high  temporal  resolution 
datasets,  including  high-resolution  biochemical,  immunological,  neurological,  and 
cardiovascular  information. 

Previous  work  on  genomic  [23,24]  profiles  of  peripheral  blood  cells  following  acute 
influenza  infection  indicate  specific  host  responses  at  just  ~45h  following  exposure, 
corresponding  to  ~35h  of  early  warning  time.  Our  combined  results  suggest  that  the 
classic  understanding  of  an  asymptomatic  incubation  phase  may  be  incomplete:  during 
viral  incubation,  subtle  sub-clinical  cues  (both  genomic,  transcriptional,  and 
physiological)  can  be  detectable  with  sufficiently  high-resolution  sensor  and  analysis 
systems.  Better  understanding  of  how  biomolecular  changes  are  captured  in  systemic 
physiological  signals  during  viral  infection  would  open  further  opportunities  for  better 
therapeutic  administration  both  before  and  during  infection,  quarantine  or  isolation, 
and  vaccine  development. 

Detecting  pathogen  exposure  before  self-reporting  or  overt  clinical  symptoms  affords 
great  opportunities  in  clinical  care  and  public  health  measures.  However,  given  the 
consequences  of  using  some  of  these  interventions  and  the  lack  of  etiological  agent 
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specificity  in  our  algorithm,  we  envision  our  current  approach  to  be  a  trigger  for  'low- 
regret'  actions  rather  than  necessarily  guiding  medical  care.  For  instance,  using  our 
high  sensitivity  approach  as  an  alert  for  limited  high  specificity  confirmatory 
diagnostics  (such  as  sequencing  or  PCR-based)  could  lead  to  considerable  cost  savings 
(an  "alert-confirm"  system).  Public  health  response  following  a  bioterrorism  incident 
could  also  benefit  from  triaging  those  exposed  from  the  "worried  well."  Ongoing  work 
focuses  on  adding  enough  causative  agent  specificity  to  discern  between  bacterial  and 
viral  pathogens;  even  this  binary  classification  would  be  of  use  for  front-line  therapeutic 
or  mass  casualty  uses.  Eventually,  we  envision  a  system  that  could  give  real-time 
prognostic  information,  even  before  obvious  illness,  guiding  patients  and  clinicians  in 
diagnostic  or  therapeutic  use  with  better  time  resolution  than  ever  before. 
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Methods 


Viruses 

The  Marburg  Angola  isolate  used  was  USAMRIID  challenge  stock  "R 17214"  (Marburg 
virus  H.sapiens-tc/  ANG/2005/  Angola-1379c).  Cynomolgus  macaques  were  exposed 
to  Ebola  virus /H.sapiens-tc /COD/  1995/Kikwit-9510621  (EBOV)  at  a  target  dose  of  100 
pfu  (7U  EBOV;  USAMRIID  challenge  stock  "R4415";  GenBank  #  KT762962). 

Description  of  Studies 

Dr.  William  Pratt  provided  physiological  data  in  NSS  format  (Notocord,  Inc.)  from 
studies  previously  conducted  at  the  United  States  Army  Medical  Research  Institute  of 
Infectious  Diseases  (USAMRIID).  Research  was  conducted  under  an  IACUC  approved 
protocol  in  compliance  with  the  Animal  Welfare  Act,  PHS  Policy,  and  other  Federal 
statutes  and  regulations  relating  to  animals  and  experiments  involving  animals.  The 
facility  where  this  research  was  conducted  is  accredited  by  the  Association  for 
Assessment  and  Accreditation  of  Laboratory  Animal  Care,  International  and  adheres  to 
principles  stated  in  the  Guide  for  the  Care  and  Use  of  Laboratory  Animals,  National 
Research  Council,  2011.  In  each  study,  remote  telemetry  devices  (Konigsberg 
Instruments,  Inc.,  T27F  for  MARV  studies  and  T37F  for  3  subjects  in  the  EBOV  study, 
and  Data  Sciences  International  Inc.,  Lll  for  3  subjects  in  the  EBOV)  were  implanted  3 
to  5  months  before  exposure,  and,  if  used,  a  central  venous  catheter  was  implanted  2  to 
4  weeks  before.  NHPs  were  transferred  into  BSL4  containment  5  to  7  days  before  viral 
exposure,  and  baseline  pre-exposed  data  collected  for  4  to  6  days  before.  Subjects  were 
exposed  under  sedation  via  either  aerosol  or  intramuscular  injection  depending  on  the 
study.  The  exposure  time  used  in  our  model  is  based  upon  the  time  of  intramuscular 
injection  or  when  a  subject  was  returned  to  the  cage  following  aerosol  exposure  (~20 
min).  All  subjects  were  monitored  until  death  or  the  completion  of  the  study.  The 
devices  measure  several  raw  physiological  signals,  which  were  translated  to  blood 
pressure  (sampling  frequency  /  =  250Hz),  ECG  (/  =  500Hz),  temperature  (/  =  50Hz),  and 
pulmonary  (/  =  50Hz)  features.  We  analyzed  data  from  three  separate  studies,  detailed 
in  Table  1. 

Physiological  Data  Pre-processing 
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Physiological  data  is  time  dependent  (that  is,  sequential  data)  and  is  subject  to  short¬ 
term  fluctuations  and  daily  diurnal  rhythms.  RF  classifiers,  however,  require  time  and 
subject  independent  data.  To  reduce  diurnal  and  subject  dependencies  from  the  data, 
each  subject  is  pre-processed  individually.  The  first  step  is  to  estimate  baseline  diurnal 
statistics  of  the  data  by  computing  a  mean,  /q,  and  standard  deviation,  ,  for  30-minute 
intervals  i  —  1, ...  ,48  over  an  average  24-hour  pre-exposure  period.  The  data  for  that 
time  of  day  is  normalized  by  subtracting  the  mean  and  dividing  by  the  standard 
deviation,  (xL  (J)  —  /rJ/07  for  each  data  sample  j  in  the  i,h  interval.  Data  are  then 
partitioned  into  sequential  /c-minute  blocks  and  aggregated  by  calculating  a  set  of  three 
summary  statistics  on  each  block:  mean  and  25%  and  75%  quantiles.  These  summary 
statistics  calculated  on  each  time-independent  signal  are  the  input  features  for  the 
random  forest  algorithm.  For  example,  30-minute  blocks  for  two  days  of  4  raw 
physiological  signals  yields  96  time  points  with  12  data  features.  Although  the 
normalization  period  and  aggregation  blocks  (k)  are  not  required  to  be  the  same,  we 
have  chosen  a  common  interval  of  30  minutes  for  both.  Data  samples  that  correspond  to 
measurements  before  challenge  are  labeled  "0"  to  denote  the  pre-exposed  class  and 
those  after  challenge  are  labeled  "1"  to  denote  the  post-exposure  class. 

Random  Forest  Algorithm 

Our  model  is  composed  of  two  random  forests  (RFs):  one  RF  is  grown  using  training 
data  prior  to  fever  onset  and  an  equal  number  of  randomly  chosen  negative  data 
samples  from  the  pre-exposure  class.  Since  the  number  of  subjects  in  each  study  is  very 
small,  we  do  not  have  a  separate  validation  set.  Flowever,  test  data  is  always  held  out 
until  the  final  evaluation  step.  Each  RF  contains  50  classification  decision  trees  grown 
on  random  subsets  of  data  and  features.  The  trees  cast  their  "votes"  for  class  "0"  or  "1", 
and  the  forest  returns  the  class  with  the  most  votes.  This  process  helps  prevent 
overfitting,  which  single  decision  trees  tend  to  do.  RFs  are  particularly  good  for 
calculating  feature  importance  metrics,  and  we  use  these  metrics  to  find  the  most 
predictive  features  for  hard  to  classify  (pre-fever)  days.  Initially  all  features  are 
considered,  but  once  the  subset  of  most  predictive  features  is  determined  within  a  cross- 
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validation  training  set,  all  RF's  are  regrown  (same  training  set)  on  this  15  feature  subset 
to  produce  the  final  models  upon  which  the  corresponding  cross-validation  testing  set 
performance  results  are  based.  Relative  importance  scores  for  each  of  the  top  15 
features  from  each  study  are  provided  in  the  Supplementary  Materials. 

Model  Performance  Evaluation:  Cross-Study  and  Inter-study  validations 

Model  performance  may  be  evaluated  by  separating  subjects  into  testing  and  training 
sets.  We  conduct  two  modes  of  evaluation:  cross-study,  where  testing  and  training  data 
are  from  different  studies  (and  thus  can  vary  in  subject  species,  virus,  and  exposure 
route),  and  intra-study,  where  both  testing  and  training  datasets  are  from  the  same 
study  (with  constant  subject  species,  pathogen,  and  exposure  routes)  thus  allowing 
model  evaluation  across  individuals.  We  used  a  3-fold  cross-validation  for  the  intra¬ 
study  tests  by  randomly  assigning  subjects  into  three  partitions.  Subjects  from  two  of 
those  partitions  form  the  training  set  to  build  the  model,  while  one  subject  at  a  time 
from  the  held-out  partition  is  tested  against  that  model.  Model  building  and  subject 
testing  is  repeated  for  all  subjects  in  a  study.  Most  cross-study  evaluations  used  all  data 
from  one  study  to  train  the  model,  and  all  subjects  of  another  study  are  tested  using  that 
model.  In  the  aggregated  cross-study  validation,  we  used  a  3-fold  cross-validation  just 
as  with  the  intra-study  tests,  including  random  assignment  of  subjects  into  the  three 
partitions.  Each  partition  included  subjects  from  each  of  the  three  studies. 

False  Positive  Thresholding,  Binary  Integration  and  Algorithm  Performance  Metrics 

We  make  declarations  of  exposure  using  a  two-stage  detection  process  (see  Figure  2).  In 
stage  one  of  the  detection  process,  RF  model  prediction  scores  (between  0  and  1  for 
every  30  minute  interval)  are  thresholded  (i.e.,  a  value  of  1  is  returned  if  the  RF  model 
score  is  greater  than  or  equal  to  the  threshold  found  above)  to  form  a  series  of  initial 
detections  for  the  model  every  30  minutes. 

These  initial  detections  from  each  RF  model  are  subjected  to  a  second-stage  detection 
test  to  further  reduce  the  false  alarm  rate.  During  the  second  stage,  binary  integration  is 
performed  over  a  sliding  window  of  the  past  n  initial  detections.  The  accumulated 
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detections  are  normalized  by  n,  giving  a  mean  score  for  the  pre-  and  post-fever  RF 
models.  Next,  scores  are  combined  by  taking  the  maximum  of  the  pre-  or  post-fever 
values  to  create  a  single  time  series.  At  each  30  minute  time  interval,  this  combined 
score  is  compared  to  a  final  declaration  threshold  of  m  /  n,  where  m  <  (we  selected  n=24 
for  a  system  latency  of  no  more  than  12  hours  and  selected  m=ll  which  approximates 
the  optimum  binary  integration  threshold  for  a  steady  signal  in  noise  [65];  performance 
is  relatively  insensitive  to  small  deviations  in  m  or  n).  The  algorithm  makes  a 
'declaration'  that  the  subject  is  in  the  exposed  class  when  the  combined  score  is  greater 
than  or  equal  to  m  /  n;  if  the  threshold  is  not  met,  the  algorithm  assigns  the  subject  to  the 
'not  exposed'  class  for  that  time  epoch.  Note  that  n  samples  are  required  before  a 
declaration  can  be  made,  so  following  the  start  of  data  collection  or  the  end  of  an 
exclusion  period  (the  24h  period  following  the  challenge),  no  declarations  are  reported 
in  the  first  30/t  minutes  (for  n= 24,  this  accumulation  period  effectively  extends  the 
exclusion  period  to  36  hours  post-challenge). 

Threshold  levels  for  the  pre-  and  post-fever  RFs  are  estimated  by  analyzing  false  alarm 
rates  (Type  I  errors)  of  the  final  declarations  versus  threshold  levels  (swept  from  0  to  1). 
We  define  the  probability  of  false  alarm  (or  Pfa)  as 

#  False  Positives 

pf  —  - 

/a  #  True  Negatives  +  #  False  Positives 

To  enforce  a  desired  significance  level  (we  choose  Pfa  =  «  .01),  we  evaluate  Pfa  for  the 
final  declarations  for  subjects  in  the  current  partition  and  estimate  the  smallest 
threshold  needed  in  the  stage-one  detection  shown  in  Figure  2b.  This  approach  is 
repeated  for  three  partitions  in  each  study,  resulting  in  independent  estimates  of  the 
threshold  pair  (pre-  and  post-fever)  for  each  partition.  While  the  desired  Pfa  =  0.01,  the 
final  overall  system  Pfa  may  be  higher  or  lower. 

To  evaluate  system-level  performance,  we  define  probability  of  correct  declaration  Pd  as: 

#  True  Positives 

d  #  True  Positives  +  #  False  Negatives 
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and  Pfa  as  above,  where  the  True  Positives,  False  Positives,  True  Negatives  and  False 
Negatives  are  evaluated  on  the  final  declaration  outputs  of  Figure  2b.  When  reporting  P., 
and  P„  for  a  study,  we  include  the  95%  confidence  interval  based  on  standard  normal 
distributions  since  the  number  of  trials  per  study  is  large  (>2000).  Although  some 
correlation  is  likely  within  a  binary  integration  window  of  30/z  minutes,  we  assume 
independence  for  trials  separated  by  at  least  30?z  minutes.  We  generate  receiver 
operating  characteristic  (ROC)  curves  to  measure  system  performance  by  calculating  Pd 
vs  Pfa  at  a  series  of  threshold  values  (sweeping  the  first-stage  detection  threshold  but 
holding  the  second-stage  m  /  n  threshold  constant)  and  quantify  the  system  performance 
with  the  ROC  area  under  the  curve  (AUC),  where  an  AUC=1.0  indicates  perfect 
performance  and  AUC=0.5  indicates  that  the  model  is  no  better  than  a  coin  toss. 
Sensitivity  (Pd)  is  expected  to  be  highest  after  febrile  symptoms  are  apparent.  To 
distinguish  the  sensitivity  of  the  system  during  the  pre-  and  post-fever  epochs,  Pd  is 
calculated  independently  for  subsets  of  positive  data  that  occur  before  and  after  the 
onset  of  fever.  The  result  is  two  ROC  curves  and  corresponding  AUCs:  one  evaluated 
on  positive  data  restricted  to  pre-fever  time  samples  and  the  other  restricted  to  post¬ 
fever  time  samples.  The  negative  data  and  two-stage  detection  process  are  identical  for 
both  ROC  curves. 

In  a  clinically  or  military-deployed  early-warning  system,  it  may  be  desirable  to 
calculate  P,  and  P  on  a  per-device  or  per-day  basis.  However,  for  this  proof-of-concept 
study,  the  limited  pool  of  subjects  available  (N— 20  total)  necessitates  calculating  P,  and 
P„  across  all  30-minute  test  points  that  are  not  in  the  exclusion  window  (12h  before  and 
24h  after  exposure).  This  approach  includes  false  negatives  that  may  occur  after  an 
initial  early-warning  declaration  is  made,  and  thus  provides  a  conservative  estimate  of 
the  device  sensitivity  which  we  predict  will  further  increase  with  larger  sample  sizes 
and  more  refined  processing  algorithms. 

Another  important  measure  of  system  performance  is  the  mean  early  warning  time. 

The  early  warning  time  for  an  individual  subject  is  defined  as  the  time  of  the  first  true 
declaration  (excluding  data  from  the  24  h  interval  immediately  following  the  challenge) 
minus  the  time  of  fever  onset  (defined  as  1.5°C  above  a  diurnal  baseline  [38]  sustained 
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for  two  hours).  Early  warning  times  vary  across  subjects  in  a  study,  so  the  mean  value 
is  calculated  across  all  subjects  to  characterize  the  early  warning  time  afforded  by  the 
system. 
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Vims 

Exposure 

method 

Subjects 

Species 

Monitoring  system 

Target 

dose 

(pfu) 

EBOV 

Aerosol 

6 

Cynomolgus 

3  subjects  with  ITS  T37F 

3  subjects  with  DSI  Lll 

100 

MARV 

Aerosol 

5 

Rhesus 

ITS  T27F 

1000 

MARV 

IM 

9 

Cynomolgus 

ITS  T27F 

1000 

Table  1:  Summary  of  NHP  studies  used.  The  EBOV  study  compared  two  different 
physiological  monitoring  systems  but  data  was  combined  and  treated  identically. 


Page  24  of  32 


Figure  1 


Time 


Figure  1:  Notional  schematic  of  the  probability  of  detection  (Pd)  for  current  symptoms- 
based  detection  (red  curve)  and  an  ideal  signal  (green  curve)  versus  time  (viral 
exposure  at  t=0),  overlaid  with  a  typical  evolution  of  symptoms.  An  ideal  sensor  and 
analysis  system  would  be  capable  of  detecting  exposure  for  a  given  Pd  (and  probability 
of  false  alarm,  P„)  during  the  incubation  period  {tidea),  well  before  the  non-specific 
symptoms  of  the  prodrome  (tfever).  We  define  the  difference  A  t  =  tfever  -  tideal  as  the  early 
warning  time  (details  below). 
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Figure  2 
(a) 


(b) 


Combined 

score 


Declarations 


Second  stage 

c 

o 

threshold 

(m/n) 

"5  o 

o 

SZ 

o 

o 


Q 


Time 


Figure  2:  Workflow  of  our  (a)  classification  approach  using  random  forests  and  (b) 
block  diagram  of  a  two-stage  detection  algorithm  to  reduce  false  alarms.  The  detection 
scheme  comprises  two  distinct  stages:  after  the  random  forest  model  score  output,  an  a 
priori  determined  threshold  (based  on  a  desired  Pj  is  applied  to  yield  initial  detections. 
These  are  then  subjected  to  a  binary  integration  step  of  the  past  n  samples,  and  the 
maximum  value  of  the  pre-  and  post-fever  models  are  taken  to  produce  a  single  time 
series.  A  second  stage  m  of  n  detection  is  applied,  which  finally  produced  a  final 
'declaration'  of  being  exposed  or  not.  See  Methods  for  detailed  descriptions. 
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Figure  3 


every  30  min  from  on  subject  in  the  MARV  aerosol  study.  The  blue  curves  indicate  an 
average  diurnal  value  for  this  subject  before  exposure.  Same  (c)  temperature  and  (d) 
heart  rate  data  after  normalization  and  calculation  of  mean,  standard  deviation,  and 
quantiles.  Vertical  red  lines  indicate  the  onset  of  fever,  defined  here  as  1.5°C  above  the 
diurnal  baseline  sustained  for  2h.  These  data  are  the  features  provided  to  the 
classification  algorithm. 
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Figure  4 
(a) 


MARV  aerosol  intra-study 


Time  (days) 


(b) 


MARV  IM  intra-study 


EBOV  aerosol  intra-study 


Figure  4:  Representative  single  subject  combined  scores  for  the  intra-study  validations: 
(a)  MARV  aerosol,  (b)  MARV  IM,  and  (c)  EBOV  aerosol.  Scores  are  updated  and  plotted 
at  30  minute  intervals  and  declarations  (green  triangles)  are  made  when  the  score 
exceeds  the  m/n  (11/24)  threshold.  Declarations  before  exposure  (t=0)  are  false 
positives;  scores  after  exposure  below  the  dashed  horizontal  threshold  line  are  false 
negatives.  The  time  between  the  green  and  red  vertical  lines  is  the  early  warning  time 
afforded  by  our  algorithm.  Note  that  data  12h  before  and  24h  after  exposure  is 
disregarded  due  to  animal  anesthesia  during  exposure. 
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Figure  5 
(a) 


MARV  IM  /  MARV  aerosol  cross-study 


Time  (days) 


EBOV  aerosol  /  MARV  aerosol  cross-study 


Figure  5:  Examples  of  single  subject  algorithm  outputs  and  declarations  after  false 
positive  thresholding  for  two  cross-study  validations:  (training  set /testing  set)  (a) 
MARV  IM/MARV  aerosol,  which  use  the  same  pathogen,  and  (b)  EBOV  aerosol  / 
MARV  aerosol,  which  holds  the  exposure  route  constant. 
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Figure  6 
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Figure  6:  ROC  curves  and  sensitivity  vs.  time  plots  for  (a,b)  MARV  aero  intra-study 
validation,  (c,d)  MARV  IM/MARV  aerosol  cross-study,  and  (e,f)  aggregated  study 
validation  tests.  Nearly  perfect  algorithm  performance  is  seen  in  the  febrile  prodrome, 
with  only  slightly  lower  performance  during  the  incubation  period,  (b) ,  (d),  and  (f) 
show  the  percent  of  subjects  correctly  declared  as  ''exposed"  as  a  function  of  time  before 
fever  for  the  MARV  aero  intra-study  (false  detection  rate  P,.- <0.001),  MARV  IM/MARV 
aerosol  cross-study  (Pf ,=0.04+0. 01),  and  aggregated  cross-study  (P  <0.0005)  validation 
tests,  respectively.  The  green  vertical  lines  represent  the  mean  early  warning  time  for 
the  entire  study. 
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Figure  7:  Using  only  ECG-derived  features  from  the  MARV  aerosol  study  to  train  the 
model,  one  representative  subject's  (a)  combined  score  output,  (b)  corresponding  ROC 
curves  for  the  entire  study,  and  (c)  percentage  of  correctly  declared  subjects  versus  early 
warning  time.  Since  core  body  temperature  is  no  longer  available  to  the  algorithm, 
model  performance  during  the  febrile  prodrome  is  slightly  worse  than  the  pre-fever 
incubation  period.  Furthermore,  while  the  overall  AUC  performance  drops  relative  to 
the  feature  sets  shown  above,  this  limited  set  could  be  collected  entirely  using  wearable 
monitoring  devices. 
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Figure  8:  Box-and-whisker  plot  summarizing  the  early  warning  times  from  both  intra- 
and  cross-study  validations.  White  vertical  lines  indicate  the  median  value  for  each 
study,  boxes  show  the  first  and  third  quartiles,  and  whiskers  represent  the  largest  and 
smallest  value.  Intra-study  MARV  tests  using  all  available  features  give  the  largest  early 
warning  times,  and  the  intra-study  EBOV  test  showed  the  worst  performance.  Cross¬ 
study  validations,  including  the  aggregate  study  that  considered  all  data  over  all 
studies,  have  very  similar  performance,  suggesting  algorithm  robustness  against  virus 
strain  and  exposure  routes.  Reducing  the  feature  set  systematically  degrades  algorithm 
performance;  the  best  performance  is  observed  using  all  available  ECG-derived 
features,  and  the  worst  performance  when  only  heart  rate  is  considered.  This  suggests 
that  subtle  electrophysiological  features  in  the  ECG  signal  (PR,  QT,  QRS  intervals,  etc.) 
are  some  of  the  most  discriminating  for  our  classification  algorithm. 
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